Predictive modelling of Parkinson’s disease progression based on RNA-Sequence with densely connected deep recurrent neural networks

The advent of recent high throughput sequencing technologies resulted in unexplored big data of genomics and transcriptomics that might help to answer various research questions in Parkinson’s disease (PD) progression. While the literature has revealed various predictive models that use longitudinal clinical data for disease progression, there is no predictive model based on RNA-Sequence data of PD patients. This study investigates how to predict the PD Progression for a patient’s next medical visit by capturing longitudinal temporal patterns in the RNA-Seq data. Data provided by Parkinson Progression Marker Initiative (PPMI) includes 423 PD patients without revealing any race, sex, or age information with a variable number of visits and 34,682 predictor variables for 4 years. We propose a predictive model based on deep Recurrent Neural Network (RNN) with the addition of dense connections and batch normalization into RNN layers. The results show that the proposed architecture can predict PD progression from high dimensional RNA-seq data with a Root Mean Square Error (RMSE) of 6.0 and a rank-order correlation of (r = 0.83, p < 0.0001) between the predicted and actual disease status of PD.

www.nature.com/scientificreports/ to other conventional machine learning techniques [3][4][5][6][7][8][10][11][12] . Eliza Courtney et al. 12 investigated the usefulness of RNA-Sequence data in the research of neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and Huntington's disease. We know that neurodegenerative diseases are characterized by a significant change in gene expression levels and splicing patterns that occur before the onset and during the progression of the disease [1][2][3][4][5][6][7][8]10,12 . Hence, the authors called for a need to investigate RNA-Sequence data in these diseases as RNA-Sequence has the advantage of digital expression profiling and can identify alternative splicing patterns that have become a major focus of degenerative disease research 12 . It is important to predict disease severity or progression accurately in Parkinson's Disease. The usefulness of transcriptome data, such as RNA-Sequence, for the classification of Parkinson's Disease, however, has not been thoroughly studied. For disease prediction problems, the traditional machine learning algorithms predict the disease status by aggregating the longitudinal features rather than leveraging through the temporal patterns 13 . To address this problem of capturing patterns from longitudinal data, Wang et al. 13 aimed to employ a two-layered RNN network of Long Short Term Memory (LSTM) to predict Alzheimer's Disease (AD) status for a patient based on the historical "clinical" information of patients such as demographic data, health history, physical examination, etc. in a classification framework. However, for Parkinson's disease progression, we believe there is no study that leverages the longitudinal temporal "transcriptomic" information of historical visits to predict the future disease status in a regression framework.
In this study, we explored whether artificial neural networks, particularly RNN, can forecast the pace of Parkinson's disease progression in the early stages of the condition. The result of this study may contribute to identify an RNN architecture that will be suitable for learning the longitudinal temporal patterns in high dimensional RNA-Sequencing data.

Materials and methods
Dataset description, analysis and pre-processing. The disease progression of Parkinson's disease that is well represented by the Movement Disorder Society-sponsored unified Parkinson's disease rating scale (MDS-UPDRS) 14 reflects the motor symptoms and clinometric properties of PD on a scale of 0 to 272 with 0 being normal and 272 being severe motor and non-motor decline. We aim to predict this MDS-UPDRS score of a patient for the immediate future hospital visit by considering the RNA sequence data of multiple previous visits. The datasets for this study were provided by PPMI 15 , which is sponsored by The Michael J. Fox Foundation. The PPMI Data Repository contains de-identified clinical, imaging, and omics data gathered as part of the PPMI project, an ongoing longitudinal investigation that was started in 2010 with the goal of locating biomarkers for the progression of Parkinson's disease. The PPMI has done a longitudinal study on 423 PD subjects who were drug naïve (i.e., not much treated with dopaminergic medications) and who have been diagnosed with PD for a period of less than 2 years at the time of enrollment. As we aim to build a disease progression tool for early diagnosed PD patients, the subjects being drug naïve is hypothesized to be beneficial. The dataset access and methods for the study were approved by the PPMI. All methods were performed in accordance with the relevant guidelines and regulations by the PPMI. Informed consent was obtained from all subjects and/or their legal guardians from the PPMI without revealing any race, sex, or age information. The PPMI subjects were followed up for a period of 4 years from the date of enrollment and the clinical data was collected at uniform intervals of 3 months, 6 months, 9, months, 12 months, 18 months, 24 months, 30 months, 36 months, 42 months, 48 months after the date of baseline visit. The PPMI data includes a series of assessments (motor and non-motor assessments), Omics data (including DNA Genotyping, RNA Sequencing), and biofluids (plasma, serum, whole blood, urine, and saliva), general neurological and physical examination, etc. For our analysis, we use only a subset of the PPMI dataset that includes transcriptomic data, motor assessment questionnaire, MDS-UPDRS score, demographics, etc.
In PPMI, not all the feature's data was collected for all the visits. For example, RNA-Seq is performed for only selected visits such as Baseline visit, 6-month visit, 12-month visit, 36-month visit, and 48 months visit whereas Motor Assessment questionnaires were performed for all the visits. Hence, we need to perform a merge operation of these datasets and select the samples for whom all the clinical data is available. We also created a few variables that are derived from primary features. In total, the dimension of the input features for our study is 34,682 that includes categorical and continuous data types of transcriptomic data (RNA-Sequence) and nontranscriptomic data (MDS-UPDRS Questionnaire I, II, III, IV, demographics, general exam, etc.). The target variable is the MDS-UPDRS final score that is a continuous variable. The complete list of features used for our analysis is presented in Supplementary Table S1.
For our analysis, only those samples for whom MDS-UPDRS-III performed in the "OFF" medication state were selected to form the disease status target variable. For a disease progression prediction problem, we need to have the motor and nonmotor condition of a subject while he/she was not on any dopaminergic medication. Also, the number of visits varies with patients in the dataset with an average number of 3.2 visits per patient. The maximum number of visits available is 5 visits and the minimum number of visits available is 1 visit. The detailed sample selection criteria for our analysis are presented in Supplementary Table S2 year's data are used as input. In order to estimate the disease status in the near future, the model takes advantage of the temporal trends in the historical RNA sequence data as we advance through time. This strategy has been previously used for disease progression in Alzheimer's Disease 13 . Given there are N annual visits for a subject, the input at time t <n> where n = 1, 2, 3...N is a feature vector denoted by X <n> . The feature vector X <n> has a Q number of features that combine both RNA-Seq features and non-RNA features represented at the nth visit of the patient. For example, X <1> is a feature vector with Q features at the first visit of the patient. Further, if the " < N > " visit represents the current visit of the subject, then the " < N + 1 > " visit represents the immediate next year's visit. The MDS-UPDRS final score at the Nth is represented by y <n> and the future value of MDS-UPDRS is represented by y <n+1>, then the proposed predictive model in its most basic form can be written as follows: where the output ŷ <n+1> is the predicted value for the MDS-UPDRS score of the next visit of the patient and the function f(.) represents the core block of our proposed model. The Eq. (1) is modified further to include the ground-truth value of MDS-UPDRS scores of the current and previous visits. For example, to predict the MDS-UPDRS score of (N + 1)th visit, we input the model with the MDS-UPDRS score of Nth visits and all the previous visits (N-1)th, (N-2)th,…., 1st. Further, we add the time interval between two consecutive visits. Let the time interval between the first visit and the immediate second visit be denoted by ∆t <1> then the time interval between Nth visit and (N + 1)th can be denoted by ∆t <n> . The final version of our proposed model is as follows: The model in Eq. (2) can be represented pictorially in Fig. 1 RNN model for PD progression. The novelty of our proposed architecture is in combining RNN with the recent advancements done on Convolution Neural Networks (CNN) such as dense connections 16 and batch normalization 17 . The introduction of dense connections showed several advantages in CNNs such as strengthening feature propagation, reduction in the number of parameters in the model, facilitating training with fewer data samples and encouraging feature reuse [16][17][18][19] . Additionally, batch normalization led to the speed up of the training and allowed the networks to generalize better by reducing the internal covariate shift. Though batch normalization was primarily introduced in CNNs, it was also employed into RNNs in the context of language modeling, question answering 18 , sentiment classification 19 . Motivated by the above improvements in CNNs and considering the nature of genetics data that is characterized by fewer data samples, huge feature size, and variable-length visits of patients, we hypothesize that the densely connected batch normalized RNNs may help in learning granular details from RNA-Seq data for PD disease progression. To the best of our knowledge, the densely connected RNNs are not employed in the disease progression of PD. Motivated by DenseNet 16 , we propose a densely connected RNN architecture that is composed of multiple Dense Blocks (DBs) stacked on top of each other such that the input of every DB is connected to the output of every other DB in a feed-forward   20 , we define H t (.) as a composite block containing an RNN layer followed by a batch normalization layer. We apply batch normalization to the output activations of the RNN layer at time step t. The output of the composite block H t at time step t can be written as follows: where here γ andβ are the model parameters that will be learned using backpropagation. E ĥ and Var ĥ are the sample mean and variance that is estimated on the current mini-batch. x <t> is the vector input at the current time step t whereas h <t−1> is the vector output of the cell state of the same RNN hidden layer from the previous time step. The function f (.) depends upon the type of the cell in the RNN layer viz. LSTM, Gated Recurrent Units (GRU), or Vanilla. A simple CB may be represented in Fig. 2.
Dense Connectivity of several CBs: Multiple CBs are stacked on top of each other such that the input of every CB is connected to the output of every other CB in a feed-forward fashion to form one Dense Block (DB). The connection is defined by a concatenation operator. A densely connected CB can be shown in Fig. 3.
Dense Connectivity of several DBs: Multiple DBs are stacked on top of each other such that the input of every DB is connected to the output of every other DB in a feed-forward fashion to form the proposed Densely connected RNN architecture for PD progression using RNA-Sequence data. An architecture with 3 Dense Blocks is shown in Fig. 4.

Evaluation metrics.
We know that PD is a progressive disease and usually the disease status of the patient will either worsen by next year's visits or would remain the same. In rare cases, the disease status would become better, but we know this will not happen until an effective treatment is given to the patient which is currently not available. To a patient, he/she might be concerned with how accurate the predicted value of the future year's MDS-UPDRS score is. We know that the model can be utilized such that when RNA-Seq of the 1st visit is given as input to the model, it predicts the 2nd year's disease status score. And when the first year's and second's RNA-Seq information is given as inputs, it predicts the 3rd year's visit disease status score. Likewise, to predict the 4th year's visit disease status, the model takes the past information of 1st, 2nd, and 3rd year and the 5th year prediction follows the same pattern. The RMSE produced by the model in predicting the MDS-UPDRS score for all the prediction time points is averaged to give the metric called Progression Identification Error (PIE). In addition, the rank order correlation between the actual and predicted MDS-UPDRS is averaged for all prediction time points and is defined as a Progression Identification Correlation (PIC). A mathematical definition of the PIE and PIC is given in the Supplementary material.
To assess if the model can properly predict MDS-UPDRS score as well as whether providing the model with data from multiple historical visits has an impact on the model's capacity to catch trends from past visits. Let us www.nature.com/scientificreports/ understand this evaluation scheme with an example. Consider a patient with a total number N visits. For predicting MDS-UPDRS for the (N + 1)th visit, the model may be provided with only most recent visit's data i.e., at Nth time point. In this case, the data at (N-1)th, (N-2)th,…,2nd,1st visits are not given to the model and the RMSE say RMSE <N> for such a prediction is noted. In the second case, to predict the same (N + 1)th disease status, we now provide data from the past two visits at Nth and (N-1)th and the RMSE <N,N-1> is noted. In the third case, to predict the disease status at (N + 1)th , we provide the data from last three visits i.e., at Nth, (N-1)th and (N-2)th and the RMSE <N,N-1,N-2> is noted. Similarly, we note RMSE <N,N-1,N-2, N-3> and proceed to find further RMSEs until we reach RMSE <N,N-1,N-2, N-3,…0.1> that is calculated by the model that is given the data from all the historical visits.

Results
In this study, we implemented our proposed densely connected RNN model with 256 Vanilla RNN cells in each composite block. A total of 4 such composite blocks form one dense block. The network is composed of 3 such dense blocks. The loss function was a mean square error and the Nadam optimizer was used in the loss function optimizer 21 . L2 regularization was applied on weights in the RNN layers. A learning rate schedule was employed to reduce the learning rate by a factor of 1/5 for every 10 epochs if there is no improvement in the validation loss. The training happened in a mini-batch fashion with a batch size of 16 subjects where each subject had the same number of time sequence/visits data. The architecture is named as "Dense Vanilla". We performed fivefold Cross-Validation (CV) on the test dataset to do the comparison in performance between the proposed model and the baseline models, and the model Dense-Vanilla achieved an RMSE of (mean = 6.01, standard deviation = 0.41) in predicting the MDS-UPDRS score and showed a rank order Correlation of (mean = 0.83, standard deviation = 0.02, p < 0.0001) between the predicted MDS-UPDRS and the true MDS-UPDRS. The metrics are reported in mean and standard errors in Table 1. We compare the performance of  . These methods has been previously used for predictive modeling of AD progression 13 and thus can be safely adopted for PD progression. We adopt a commonly used training strategy for aggregating features of all patients' historical visits to train these baseline models 13,22 . That is, numerical, ordinal and nominal features across different visits are aggregated respectively with their corresponding mean, median and mode computed over all historical N visits. Details of this step is provided in the supplementary material.
To test if the improvement in the performance of our proposed model was statistically significant as shown in Fig. 5 as compared to the comparison methods, the metrics obtained from 5-Fold CV were given to the student's t-test with a significance level of alpha = 0.05, degrees of freedom = 4. The results are described in Table 2.
In addition, to analyze the impact of various feature categories on the model performance, we conducted experiments with and without non-transcriptomic feature categories (Motor Assessment, Subject Characteristics, General Exam, Age, Time Interval) and the results are provided in Table 3 and (Fig. 6).

PIE (RMSE)
DECISION TREES Figure 5. Comparison of Performance. Table 2. Statistical significance test for model comparison.

Hypothesis testing t-Test on PIE(RMSE) t-Test on PIC(Correlation)
Our www.nature.com/scientificreports/ Effect of dense connections. We would like to see the performance of the architecture by changing the RNN cell to GRU and LSTM, keeping all the other hyperparameters the same. Furthermore, we compare the performance of densely connected RNN models with that of models with plain configurations. Here plain configuration means that no dense connections were introduced either between Composite Blocks or between Dense Blocks. All the other hyperparameters (structural, optimization, and regularization) were kept the same to provide a fairground for comparison. As shown in Table 4, Dense Connections consistently provide improvements for different types of RNN cells.
As illustrated in Table , there is a significant difference in the RMSE obtained by models with dense Connections and models without dense Connections. We observe the behavior for all three categories of RNN cell types (Vanilla RNN, GRU, and LSTM). In each of the categories, the model with dense connections has better Table 3. Results of the proposed dense vanilla model with and without non-transcriptomic input features.

Evaluation metrics PIE (RMSE) PIC (Correlation)
RNA-Seq with non-transcriptomic features 6.01 ± 0.185 0.83 ± 0.01 RNA-Seq features only 6.87 ± 0.265 0.71 ± 0.03 Figure 6. Predicted and Ground Truth disease progression curves for Test Patients. Each subplot belongs to one test patient. The curve in blue is the ground truth curve and the curve in orange is the one predicted by our Predictive Model. The x-axis has discrete visit points of 6 months(V02), 12 months (V04), 24 months (V06), 36 months (V08), 48 months (V10) and the y-axis has the MDS-UPDRS score for the patient. Effect of batch normalization. We observed that the proposed composite function block CB plays an important role in helping the densely connected RNN models to learn. The Batch Normalization layer on top of the RNN layer in a CB block is integral to the successful learning of our models. We conducted experiments where the normalization layer in the CB block was removed and the remaining model configurations remained the same. Table 5 shows the performance comparison between the models that are identical to each other in all aspects except the batch normalization layer in the CB block.
In each of the categories of RNN cell types (Vanilla RNN, GRU, and LSTM), the model with batch normalization has lower RMSE in predicting the MDS-UPDRS as compared to the model without batch normalization. Furthermore, a higher rank order correlation in predicting disease status is obtained when the models have batch normalization.
Ethical approval. This study was funded by the NSERC CREATE funded "Biomedical Engineering Smartphone Training" program (CREATE-BEST), and the Office of Research Ethics and Integrity at the University of Ottawa reviewed and approved.

Discussion
As illustrated in Figure our model is able to capture the trend and predict the MDS-UPDRS scores for a period of 4 years at discrete visit points of V02 (6 months), V04 (12 months), V06 (24 months), V08 (36 months), and V10 (48 months). For the sake of simplicity, we presented the disease progression graphs for the patients in the test dataset with 5 visits.
For almost all the test patients, the disease progression predicted by our model aligns with the ground truth disease progression values. This confirms that our proposed model was able to learn the medical patterns and granular features from the longitudinal data of the transcriptome of the patient. As per the 5-CV model evaluation, we know that the predictive model has an RMSE of 6.01 in predicting the MDS-UPDRS score and a rank correlation of (PIC = 0.83, p-value = 0.0001).
As illustrated in Table , there is a significant difference in the RMSE obtained by our proposed model as compared to the classical machine learning methods of Linear Regression, SVM, Decision Trees, and Random Forests. Thus, our model outperforms all the baseline methods in terms of RMSE and PIC in predicting the MDS-UPDRS score. Furthermore, there is a significant difference in the correlation obtained by our proposed model as compared to the baseline methods. As illustrated in Table , the improvements in the performance of our model as compared to the performance of baseline methods are statistically significant at a significance level of 0.05.
As indicated in Table , we wanted to know the impact of adding non-transcriptomic input features such as Motor Assessment, Subject Characteristics, General Exam, Age, Time Interval to the RNA-Seq features while training the proposed model. It will be interesting to discuss the acceptability of RMSE of 6.01 in real-world applications. While MDS-UPDRS is a highly valid and reliable instrument to capture the true phenotype 23,24 , it is susceptible to "noise" 25 . The two major sources of noise are measurement error (inter and intra-rater variability) and short-term effects (mood, stress, climate, time of the day, etc.) that are irrelevant to the overall progression of the disease 24 . It is reasonable to state that if the RMSE of 6.01 in predicting the MDS-UPDRS is lesser than or equal to the noise inherent in estimating the true MDS-UPDRS score then the predictive model has an acceptable error for real-world applications.
To this end, as per the study 25 , the authors calculated the error variance due to noise in estimating the true value of MDS-UPDRS score and reported an error variance of 4.87 in an estimating score of part-I, 3.66 in part-II, and 15.52 in Part-III of MDS-UPDRS. The total noise component of the MDS-UPDRS score (part-I, part-II, and part-III) is thus 8.059. As the RMSE of 6.01 in predicting the total MDS-UPDRS score by our predictive model is lesser than the inherent noise of 8.059 in the instrument itself, it is safe to assume that the performance of the predictive model is acceptable for real-world applications. Moreover, as per the recent study 26,27 in predictive modeling of PD using imaging genetics on a combination of DNA genotyping and neuroimaging, the authors reported an RMSE of 7.82 in predicting MDS-UPDRS-Score. This value of RMSE is comparable to the RMSE of our predictive model that uses transcriptomic data.

Conclusion
To the best of our knowledge, this study is the first to investigate the usefulness of Omics data such as RNA-Seq in the predictive modeling of the PD Disease Progression using densely connected deep RNNs. The contributions of this study are as follows: 1. We proposed a deep RNN structure that can predict the future year's MDS-UPDRS score of a patient by taking the inputs of the previous year's RNA-Sequence data. The model can leverage the temporal patterns in the historical RNA sequence data. 2. The Predictive model is adaptive over time as the model was trained with irregular visit time intervals and the various number of visits. The model is able to predict the MDS-UPDRS score with an RMSE of 6.01 and a rank correlation of 0.83 between the predicted and true values of MDS-UPDRS. 3. We observed that the introduction of Batch Normalization and Dense Connections play an important role in making the multi-layered RNN to learn the features from high dimensional gene expression data.

Data availability
The datasets are available to download from the PPMI on the date: 2018-06-05 and the latest update of all datasets are downloaded on 2020-01-28. In addition, the corresponding author will provide the dataset utilized and/or analyzed for the current work upon reasonable request.